A mixed-effects stochastic model reveals clonal dominance in gene therapy safety studies

Background Mathematical models of haematopoiesis can provide insights on abnormal cell expansions (clonal dominance), and in turn can guide safety monitoring in gene therapy clinical applications. Clonal tracking is a recent high-throughput technology that can be used to quantify cells arising from a single haematopoietic stem cell ancestor after a gene therapy treatment. Thus, clonal tracking data can be used to calibrate the stochastic differential equations describing clonal population dynamics and hierarchical relationships in vivo. Results In this work we propose a random-effects stochastic framework that allows to investigate the presence of events of clonal dominance from high-dimensional clonal tracking data. Our framework is based on the combination between stochastic reaction networks and mixed-effects generalized linear models. Starting from the Kramers–Moyal approximated Master equation, the dynamics of cells duplication, death and differentiation at clonal level, can be described by a local linear approximation. The parameters of this formulation, which are inferred using a maximum likelihood approach, are assumed to be shared across the clones and are not sufficient to describe situation in which clones exhibit heterogeneity in their fitness that can lead to clonal dominance. In order to overcome this limitation, we extend the base model by introducing random-effects for the clonal parameters. This extended formulation is calibrated to the clonal data using a tailor-made expectation-maximization algorithm. We also provide the companion package RestoreNet, publicly available for download at https://cran.r-project.org/package=RestoreNet. Conclusions Simulation studies show that our proposed method outperforms the state-of-the-art. The application of our method in two in-vivo studies unveils the dynamics of clonal dominance. Our tool can provide statistical support to biologists in gene therapy safety analyses. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05269-1.


.1 Stochastic quasi-reaction networks
Stochastic quasi-reaction networks (S-QRNs) allow to implement a particular class of stochastic differential equations that can be used to model biochemical reactions. More formally, let y y y t = (y 1t , . . . , y nt ) ′ ∈ N n 0 (1) be a collection of molecules of n different types observed at time t, and consider K distinct (and competing) reactions r j1 y 1 + · · · + r jn y n θj → p j1 y 1 + · · · + p jn y n , j = 1, . . . , K each occurring with its own rate θ j . The coefficients r ji 's defining the left-side of the reaction are called reagents and represent the minimum amount of molecules of type i needed for the j-th reaction to occur. Similarly, the coefficients p ji defining the right-side of the reaction are called products and represent the amount of produced molecules of type i after the j-th reaction is triggered. We assume that, if we observe y y y 0 = (r j1 , . . . , r jn ) ′ molecules at time t = 0, the j-th reaction will occur after T j ∼ Exp(θ j ) , j = 1, . . . , K , Namely, if exactly r ij molecules of each type i would be present, then the j-th reaction can only take place in one way, with the exponential hazard rate θ j . The interpretation is that, after a waiting time T j , r ji molecules of type i collide with each other and produce p ji molecules of type i (∀i = 1, . . . , n), while the molecules move randomly in a hosting "cellular" environment. However, in general at time t = 0 we might observe Y i0 ≥ r ji molecules of each type i and, therefore, the j-th reaction can take place in a combinatorial number of ways leading to the following waiting time formulation where is the vector parameter for the reaction rates, and is the j-th hazard rate. In this case, the effect will be that at time t + T j we have the following expression for the number of molecules of substrate i, where v ji = p ji − r ji is the j-th net effect. More compactly, for a set of K reactions and n species, the molecular transfer from reagent to product species is a net change of where P P P = [p ji ] ′ denotes the n × r dimensional matrix of products, R R R = [r ji ] ′ is the n × r dimensional matrix of reactants, and V V V = [v ji ] ′ is an n × r dimensional matrix called net-effect matrix . Therefore, a S-QRN of K-distinct reactions is fully identified by a net-effect matrix V V V and by the hazard vector h h h(y y y, θ θ θ) = (h 1 (y y y, θ θ θ), . . . , h K (y y y, θ θ θ)) ′ . (9)

Simulating a trajectory of molecules
A τ -leaping algorithm is an alternative method to a Gillespie algorithm for simulating triggering-chain events. Instead of simulating a waiting time for the first reaction to occur and selecting the corresponding winner reaction, a τ -leaping algorithm simulates the number of occurrences of each possible event after a time-lag equal to τ elapsed. Formally, let {N r (t)} t≥0 be an inhomogeneous Poisson point process representing the number of reactions of type r that took place up to (and including) time t. Therefore The last equation gives an estimate of the expected number of reactions of type r that took place within the time interval [t, t+∆ t [. Therefore, the expected number of molecules y y y t+∆t at time t + ∆t given the current number of molecules y y y t can be easily obtained by adding to y y y t the product between the expected number of events E[N r (t + ∆t) − N r (t)] that have happened in the time interval [t, t + ∆ t [, the corresponding net-effect, and the number of ways that reaction can occur, leading to The pseudocode of the τ -leaping algorithm (τ -LA) is reported in Algorithm 1.

The Master Equation
In practice it is common that the reaction rates of a stochastic reaction network are unknown, and the goal is to estimate them given a collected dataset. In order to estimate the rates θ θ θ = (θ 1 , . . . , θ K ) ′ using a likelihood-based approach, we need to define an underlying probabilistic model. One of the most natural choices for describing stochastic chemical kinetics of Eqs. (2)-(9) is the chemical master equation dP (y y y; t) dt = K j=1 {h j (y y y − V ·j ; θ θ θ)P (y y y − V ·j ; t) − h j (y y y; θ θ θ)P (y y y; t)} , Algorithm 1: τ -leaping algorithm Input: S (no. simulations), y y y 0 (initial state), τ (time lag), θ(t) (reaction rates) Output: {y y y t } t t ← 0; y y y t ← y y y 0 ; consistently with Eq. (4). It describes the temporal evolution of the probability density function P (y y y; t) of the state vector y y y of the chemical system (2). Roughly speaking, the first part of the right-hand side of Eq. (12) models all the reactions letting the state out of k(̸ = j), whereas the second part models all the reactions which brings the state back to k. It is often the case that the Master equation is computationally intractable, especially when the state vector y y y is high-dimensional, so that the number of possible states the system may occupy is too large. Several approximations of the Master equation exist [1,2], and here we describe a procedure for "continuizing" the discrete-state chemical Markov process defined by Eqs.
Theorem 1. Assume that h j (x x x; θ θ θ)P (x x x; t) are analytical functions in x x x. Then, a second order Taylor expansion of the products h j (x x x − V ·j ; θ θ θ)P (x x x − V ·j ; t) around x x x leads to the Ito-type stochastic differential equation called the Kramers-Moyal approximation where the drift function and the dispersion matrix are given by Proof. The analytical assumption of h j (x x x; θ θ θ)P (y y y; t) in x x x allows us to consider a second- and therefore and by plugging it in the Master equation (12) we have which we recognize as a Kolmogorov forward (Fokker-Plank) equation with drift function which completes the proof.

Euler-Maruyama approximation
Using previous results and some linear algebra, the approximated Ito equation (14) can be further approximated as . . . h1(y y yt,θ θ θ) ∆ε ε ε t ∼ N (0 0 0, I I I n ) , or more compactly where we included the term σ 2 I I I N so as to prevent singularity of the diffusion term, and to additionally explain noise variance. In practice, since we collect only discrete-time increments ∆y y y t = y y y t+∆t −y y y t , we consider an Euler-Maruyama local linear approximation (LLA) of the approximated Ito equation. Indeed we also replaced the infinitesimal increments dt and dy y y t with the discrete increments ∆t and ∆y y y t . Then, all the timespecific blocks can be stacked together obtaining the full generalized linear model (GLM) formulation . . .
which is convenient for parameters inference.

Maximum Likelihood (ML)
We infer the parameters (θ θ θ, σ 2 ) of Eq. (19) with a maximum likelihood approach, that is we solve the following constrained optimization problem where the objective function is and we compactly write the diffusion matrix W W W * = W W W (θ θ θ, σ 2 ) as a function of the free parameters. Using the rules of matrix calculus [3], the partial derivatives of f w.r.t. θ θ θ and σ 2 can be written as where Then, we solve the optimization problem (20) by using the objective function (21) and its gradients (22)-(23) inside the L-BFGS-B optimization algorithm from the optim() function of the stats R package. The inference procedure is summarised in Algorithm 2.

Random-effects stochastic reaction networks
From Eq. (19) it can be seen that all the molecules y 1 , . . . , y n share the same parameter vector θ θ θ. In some cases it may happen that the molecules being analysed are drawn from a hierarchy of J different populations having different properties. In this case it might be of interest to quantify the population-average θ θ θ and the subject-specific effects u u u around the average θ θ θ for the description of the subject-specific dynamics. Therefore, to quantify the contribution of each subject j = 1, . . . , J on the process's dynamics we extended the LLA formulation of Eq. (19) by introducing random effects u u u for the J February 22, 2023 5/9 distinct subjects on the parameter vector θ θ θ, leading to the following mixed-effects [4] formulation where M M M is the block-diagonal design matrix for the random effects u u u centered in θ θ θ, and each block M M M j is subject-specific. As in the case of the null model of Eq. (19), to explain additional noise of the data and to avoid singularity of the stochastic covariance matrix W W W (θ θ θ) we added to its diagonal a small unknown quantity σ 2 which we infer from the data. In order to infer the maximum likelihood estimatorψ ψ ψ for ψ ψ ψ = (θ θ θ, σ 2 , τ 2 1 , . . . , τ 2 p ) we developed an efficient expectation-maximization E-M algorithm where ∆y y y and u u u take the roles of the observed and latent states respectively. Under this framework p(u u u|∆y y y) ∝ u u u p(∆y y y|u u u)p(u u u) and therefore u u u|∆y y y ∼ N Jp (E u u u|∆y y y;ψ ψ ψ [u u u], V u u u|∆y y y;ψ ψ ψ (u u u)) , Also, the joint log-likelihood of ∆y y y and u u u is given by l(∆y y y, u u u; ψ ψ ψ) ∝ ψ l(∆y y y|u u u; ψ ψ ψ) + l(u u u; ψ ψ ψ) which only depends on u u u linearly via its first two-order conditional moments of Eq. (28). Therefore, it follows for the E-step function that Q(ψ ψ ψ|ψ ψ ψ * ) = E u u u|∆y y y;ψ ψ ψ * [l(∆y y y, u u u; ψ ψ ψ)] = − 1 2 log|Σ Σ Σ(θ θ θ, σ 2 )| February 22, 2023 6/9 The gradient of Q(ψ ψ ψ|ψ ψ ψ * ) is defined by the following partial derivatives +E u u u|∆y y y;ψ ψ ψ * [u u u]E u u u|∆y y y;ψ ψ ψ * [u u u] ′ + +E u u u|∆y y y; In the E-M algorithm we iteratively update the E-function Q(ψ ψ ψ|ψ ψ ψ * ) using the current estimate ψ ψ ψ * of ψ ψ ψ and then we minimize the −Q(ψ ψ ψ|ψ ψ ψ * ) w.r.t. ψ ψ ψ. The E-M algorithm is run until a convergence criterion is met, that is when the relative errors of both the E-step function Q(ψ ψ ψ|ψ ψ ψ * ) and the vector parameter ψ ψ ψ are lower than a predefined tolerance. Once we get the E-M estimateψ ψ ψ for the parameters we evaluate the goodness-of-fit of the mixed-model according to the conditional Akaike Information Criterion [5]. As every E-M algorithm, the choice of the starting point ψ ψ ψ s is very important from a computational point of view. We chose as a starting point ψ ψ ψ s = (θ θ θ M L ,σ 2 M L , τ 2 1 = 0, . . . , τ 2 p = 0) where (θ θ θ M L ,σ 2 M L ) is the optimum found in the fixed-effects LLA formulation of Eq. (19). This is a reasonable choice since we want to quantify how the dynamics E u u u|∆y y y;ψ ψ ψ [u u u] j of each subject j departs from the average dynamicsθ θ θ M L . The E-M pseudocode is given in Algorithm 3.

Data rescaling 2.1 Rhesus macaque study
Although the sample DNA amount was maintained constant during the whole experiment (200 ng for ZH33 and ZG66 or 500 ng for ZH17), the sample collected resulted in different magnitudes of total number of reads. a regression approach. More precisely, we first perform a log-link Poisson regression on the collected cell counts y y y against the corresponding confounding factors, leading to the following model log(λ λ λ) = X X Xβ β β , y i ∼ Poisson(λ i ) , where y i is the i-th component of y y y, λ i is the i-th component of λ λ λ for i = 1, . . . , n, X X X = 1 1 1 X X X c is the full design matrix including a term 1 1 1 ∈ R n×1 for the intercept and a term X X X c ∈ R n×4 with confounder-specific columns. After having estimated the parametersβ β β = β 0 ,β β β ′ c ′ with a Fisher scoring algorithm, the rescaled clonal tracking data has been defined as the partial residuals corresponding to the confounders, that is y y y res = exp log(y y y) − X X X cβ β β c , whereβ β β c are the optimal parameters for the confounders.